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ABSTRACT 


An examination and comparison of pilot rating scales presently 
in use and an investigation into the possibilities of a linear rating 
scale were conducted. The hypothesis was advanced that a rater 
may transpose his impression of performance directly to a non- 
adjectival, non-ordinal rating scale and thereby relate his psycholog- 
ical continuum to a numerical index. Experimental data, though 


limited, tended to support this hypothesis. 
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PeeeeRODU CT ION 


With the advent of flight vehicles with operating envelopes ranging 
from terra firma to the threshold of space and beyond, the environ- 
mental and dynamic spectrums encountered ona single flight are 
all-encompassing. Man is the low frequency response component 
[Ref. 1] in the overall closed-loop man-machine system, therefore, 
control systems must be designed within manageable limits. In short, 
the effort expended in vehicle control must be minimized so that the 
mulcteraay be free to complete other duties in the cockpit. 

Consequently, the suitability of a machine system to serve its 


intended mission is ultimately determined by a series of evaluations. 


i] 


he most difficult of these assessments occurs at the man-machine 
system interface. 

Pilot evaluation of handling qualities determines the suitability 
of the machine system, yet there remains to be found a set of 
universally acceptable parameters for this evaluation. The complete 
nature of a pilot's task, work load, mental stress and acuity have 
not been described in any form of analytically determined transfer 
function or performance index [Ref. Zi It is assumed, however, 
that there exists a relationship between pilot comment and perfor- 
mance and/or vehicle handling qualities. 

Efforts to standardize the qualitative aspects of language into 


a quantitative handling quality rating have been made. It is the 


purpose of this study to examine and compare the rating scales 





presently in use and to investigate the possibilities of a linear rating 


scale with its inherent advantages. 





itis lOR Y 


Peete RLY DEVELOPMENTS 

During the early 1930's when aviation was maturing, the need to 
delineate acceptable aircraft parameters was recognized. Consequently, 
a ''check list'' for this purpose was proposed by Edward P. Warner 
{Ref. 3]. Subsequent work by Soule' and by R. R. Gilruth at the 
Langley Laboratory of NACA condensed these requirements anda 
set of specifications for military aircraft acceptance eventually 
resulted [Ref. 4]. 

After this initial break-through in establishing aircraft specifica- 
tions, emphasis was placed on devising pilot opinion ratings aimed at 
nese completo oress, Ihe concest Of 2 gemeral oilot rating received 


little attention. 


Pee COOPER SCALE 

In 1957 at the annual meeting of the Flight Testing Session, 
Institute of Aeronautical Sciences, Ames' Chief Research Pilot George 
E. Cooper introduced a generalized pilot rating scale which enjoyed 
immediate and almost total acceptance [Ref. 5]. This epoch scale 
(Fig. 1) synthesized the previous work of NACA Langley and thereby 
provided an authenticated scale which could be applied to any aircraft 
handling qualities evaluation. It was the first rating scale to associate 
the qualitative nature of pilot opinion with a quantitative index. 

In applying this scale, it was recommended that the evaluator 


pay particular attention to question formulation (Fig. 2). The 





NORMAL 
OPERATION 


EMERCENCY 
OPERATION 


NO 
OPERATION 





NUMERICAL PRIMARY 
DESCRIPTION 


RAT ING MISSION 
Excellent, includes optimum 
Sstisfsctory Cood, plessant to fly 


Satisfsctory, but with some mildly 
unplesssnt cherscteristics 


Acceptsble, but with unpleassnt 
chserscteristics Yes 


Unastisfsctory Unscceptahle for normal 
operation Doutt ful 


Acceptable for emergency 
condition only* Doubt ful 


Unacceptsble even for 

emergency conditirn* Doubt ful 
Unscceptsble Unacceptsble - Dangerous No 

Unacceptsble - Uncontrollsble : No 


Lnprintable “Motions possitly viclent enough 
to prevent pilet escape’ 


*Failure of stability sugmenter 
COOPER SCALE 


FICURE 1} 


Selection of proper tss} nor maneuver 


RUE EEA Ambiguity - Pilot responsitility 


Use of words 


1 . ; i 
ANSWERING THE QUESTION Standardized rating system 


WEIGHING THE ANSWER BY 
CONSIDERATION OF PILOT'S 
PPESENT VIEWPOINT Current duties and responsibilities 


EXPERIENCE Training - Types of aircraft flown 


ADAPTABILITY Human resourcefulnesa and capacity 


INITIAL IMPRESSIONS 

LEARNING CURVE 8 © 

PILOT WORK 

4s a resesrch tool 


USE OF GROUND SIMULATOFS Invalidating assumptinns 
Pilot responsitility 


FACTOPS AFFECTING PILOT OPINION 


FIGURE 2 


10 





question had to be sufficiently specific so as to minimize interpreta- 
tion and ambiguity. 

The pilot, in answering the question, was required to channel his 
exposure, sensations and reactions into the scale vocabulary by first 
considering four handling qualities categories: Satisfactory, Unsatis- 
factory, Unacceptable and Unprintable. As may be noted from Figure l 
these categories were separated, for description purposes, at the 
approximate values 3.5, 6.5 and 9.5 respectively. Within each cate- 
eory, the pilot was required to further define his opinion in terms of 
the scale vocabulary and a secondary mission (landing). 

Once the pilot had formulated his opinion with respect to the scale, 
his evaluation had to be weighted in consideration of his viewpoint, 
experience and adaptability. For example, a patrol pilot might evalu- 
ate the stall-asscciated buffet and departure ina fighter as "Unaccep- 
table-Dangerous'' (numerical rating 8); whereas, a fighter pilot might 
evaluate the same characteristics as ''Satisfactory, but with some 
unpleasant characteristics'' (numerical rating 3). Then, with some 
Exposure, the same two pilots might reevaluate the characteristics 
at 4 and 2 respectively. The rating scale was, therefore, very subject 
to experience and adaptability. To eliminate this deficiency and to 
provide some measure of consistency, it was suggested that the scale 
be used only by test pilots. 

Though the Cooper Scale had claim to primacy, it was ambiguous 
in its definitions and complicated in that it placed stipulations on pilot 
opinion. It would appear that the scale was designed to evaluate 


aggregate handling qualities. 


1] 


eee HARPER SCALE 

Robert P. Harper, Jr. used a pilot opinion scale (Fig. 3) for 
evaluating the handling qualities of a variable stability aircraft in 
1959 [Ref. 6]. The Harper Scale was developed honoring the stipula- 
tions of question formulation but with a concept quite different from 
the Cooper Scale. Harper was interested in evaluating pilot-vehicle 
performance, but found this extremely difficult because of pilot 
adaptability. Instead, a scale was devised to evaluate pilot opinion 
with respect to alterations inthe stability derivatives and thereby 
arrive ata pilot preference: a most suitable aircraft stability. 

To ensure reliability and compensate for scale vocabulary de- 
ficiencies, test pilots wire-recorded their subjective comments 
during the evaluation and recorded their scale rating following each 
evaluation. This was, perhaps, the best aspect of the testing 
procedure. The pilot rating was kept simple and subordinate to the 
subjective evaluation. Because of this reliance on subjective com- 
ments made during the tests, the pilot rating was utilizedasa 
cursory index to the evaluation and not as an end in itself. 

In evaluating the handling qualities with respect to the rating 
scale, the pilot considered four handling qualities categories: 
Acceptable and Satisfactory, Acceptable but Unsatisfactory, Unaccept- 
able, and Unflyable. The separation between these categories 
occurred at 3.5, 6.5 and 9.5 respectively. Within each category, 
the pilot further defined his opinion in terms ofa single, though 


sometimes ambiguous, adjective (Fig. 3). 
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ADJECTIVE DESCRIPTION 
WITHIN CATEGORY 


NUMEPICAL 
RATING 






CATEGORY 














Excellent 
Good 
Pair 


Acceptable 
and 
Satisfactory 






Acceptable 
but 
Unsatisfactory 


Bad* 
Unacceptable Very bad** 
Dangerous**% 
Unflyable Unflyable 


*Requires major portion of pilot's attention 
*k*kControllable only with a minimum of cockpit duties 
#kkAircraft just controllable with complete attention 









HARPER SCALE 


FIGURE 3 : 


1) 





Deedee kh PER SCALE ADAPTATIONS 

In contrast to the Cooper Scale, the Harper Scale (often cited 
as the Cornell or CAL Scale because of its extensive use by Cornell 
Aeronautical Laboratory, Inc. ) was designed as an index for evalua- 
ting particular and highly restricted handling qualities. Efforts to 
adapt the CAL Scale to the evaluation of aggregrate handling qualities 
met with varied success. 

One such example was the application made by Michael L. Parrag 
in 1967 [ Ref. 7] in studying the effects on handling qualities of higher- 
order response characteristics against a background of varying 
conditions and associated mission tasks. 

To facilitate more reliable and consistent pilot comments, the 
test pilots were provided with a comment check list for the two flight 
conditions (Fig. 4), and instructed to make subjective comments 
following each test run. After all tasks were completed, a compre- 
hensive subjective report was required incorporating all the salient 
features of each configuration. Finally, an objective report using 
the comment check list was made. 

Here, as in Ref. 4, emphasis was placed on subjective conmmentes 
Task-oriented objective comments were used to provide consistency 
and point out features of each task which might otherwise have been 
overlooked. Although the CAL Scale was used as an index to pilot 
Opinion, it was, for all practical purposes, insignificant in evaluating 


the handling qualities investigated. 


Pa GoOriy Rn HARPER sCALE 


With wide and independent usage of the Cooper and Harper Scales 
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APPPOACH COMMENT CAPD 


Poe tor lneenthe ANE DIFFICULT TO TRIM? 


Zee ts AULITUDE CONTPOL SATISFACTOPY? 
TENDENCY TO OVERCONTROL? 


3. IS MAINTAINING ALTITUDE A PPOBLEM? 
a) STRAIGHT AND LEVEL 
b) TURNS 


4. IS MAINTAINING AIRSPEED A PROBLEM? 


See were Gille SLOPE ERRORS EASILY CORRECTED? 
WAS IT DIFFICULT TO MAINTAIN GOOD GLIDE 
SLOP ESCONTPOL? 


6. WHAT INSTRUMENTS ARE YOU USING MOST? 


7. COULD YOU MAKE AN INSTRUMENT LANDING APPPOACH 
Woe THIS CONFIGURATION AT THE SPEED OF 125 KNOTS? 


8. PILOT RATING - ADJECTIVES - NUMBER - WHY? 


HICH ALTITUDE COMMENT CAFD 


Poe ieeATREL ANE SD IPELCLET 10 TRIM: 


Tonelli CONTROL SATISEACTORY2 TENDENCY TO OVEF 
CONTROL? 


IS NORMAL ACCELEFATION CONTROL A PROBLEM? 


ESPHOLDING ALTITUDE A PROBLEM? 
a) STRAIGHT AND LEVEL 
b) TURNS 


Woo lneRpes wy) DIFFICULTIES IN FLIGHT PATH CONTROL 
DERING Tie CLIMBING AND DESCENDING TURNS? 


Rie wininiemen:, PROBLEMS ASSOCIATED WITH THE 
TRACKING TASK? 


PILOT RATING - ADJECTIVES - NUMBER ~- WHY? 





PILOT COMMENT CARDS 


FIGURE 4 
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the problems previously cited for each were sources of confusion in 
application. It became increasingly apparent that an acceptable 
composite rating system incorporating the best features of each 
scale would be advantageous. 

To this end Cooper and Harper jointly advanced a revised rating 
scale in 1966 [Ref. Sie This scale (Fig. 5), hereafter referred to 
as the Coopver-Harper Scale, enjoyed general acceptance and prefer- 
ence over the previous scales; however, the various implementing 
institutions voiced a need for clarification in semantics and in applica- 
tion. In 1969 an explicitly comprehensive joint report was published 
to modify and clarify the Cooper-~Harper Scale [Ref. 9]. The report 
precisely defined flight evaluation terminology and discussed the 
aspects of question formulation and scale data application. 

Based on the voluminous data and comments available fram 
international audiences of the Cooper and Harper Scales, the Cooper- 
Pamper ocale was excellently designed as a dichotomous procedure 
of evaluation. A pilot, in evaluating a handling quality, systematically 
chose between two alternatives which channeled his consideration into 
a rating category or into another dichotomous decision with the same 
channeling result. Through this simplified procedure (compare with 
the relative complexity of previously discussed procedures) three of 
four existing categories were eliminatel without ever considering 
the applicable descriptive adjectives. 

The inverted ten-point scale was retained in the interests of 
consistency. An ordinal sequence varying in magnitude with the 
degree of ''goodness'' would seem more appropriate; however, 


audiences of the previous scales had become accustomed to the 
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inverted scale and a reordering of the numerical indices would have 
resulted in unnecessary confusion. To further ease the transition 
from previous scales, the boundaries of 3.5, 6.5 and 9.5 were 
retained. 

It would appear that a satisfactory method for assessing the 
man-machine interface had been achieved; but not quite. Although 
the Cooper-Harper Scale continues to be the most widely used 
evaluation system to date, it remains insensitive at the bad end and 
does not exhibit the desirable feature of linearity. Linearity is that 
feature of a rating scale which will allow the averaging of data 


ensembles without distorting the data sample interpretation. 


Eee MCDONNELL SCALE 

In 1968, 3. D. McDonnell published his study cf rating techniques 
[Ref. 10]. His objective was to evolve a rating scale which had an 
underlying linear structure to facilitate mathematical operations on 
pilot data. This underlying structure required the discipline of 
psychophysics for determination. 

Although a detailed examination of psychophysics is beyond the 
scope of this study, the basic theory is presented for Beeiication 
If an objective measure is made upon some object, the resulting data 
must lie along some physical continuum. If an evaluator estimates 
a measure, the measure is subjective and must lie along some 
psychological continuum. The relationship between these two con- 


tinua, if it could be determined, would provide a means of linear- 


izing the subjective scale. 


18 





To establish an intervaled psychological continuum, a list of 
sixty-four appropriately descriptive phrases were saeseaiy sub- 
mitted to sixty-three raters. For each phrase, the raters were 
instructed to indicate their impression of a hypothetical vehicle 
so described ona plot with the end points of ''most favorable" and 
"least favorable''. The data were then processed by the methods 
of psychophysics and successive intervals and assigneda relative 
standing ona scale of nine. The data were further reduced to the 
arbitrary seven-point McDonnell Scale depicted in Figure 6. 

The McDonnell Scale (often called the Global Rating Scale be- 
cause it related aggregate handling qualities) was, therefore, 
presumed to be a linear scale of constant subjective sensitivity 
reflecting the resolving power of raters to distinguish semantic 
differences. Because it was related to a seven-point scale in 
contrast to the ten point scales with which users were familiar, it 
was not accepted with any noticeable exuberance. 

The truely important contribution made by McDonnell was the 
list of evaluation phrases related to an index of nine and reflecting 
psychological sensitivity. The phrases were divided into six 
categories: Handling Qualities, Control, Precision, Response 
Characteristics, Effects of Deficiencies, and Demands on Pilot. 
Through the use of this listing, specialized linear scales may be 


constructed to satisfy particular rating requirements. 


loce APPENDIX B. 
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FAVORABILITY OF HANDLING QUALITIES 


0 


8[} 


— Excellent 


— Highly Desirable 


— Good 


= Fair 


= Poor 


= bad 


— Nearly Uncontrollable 


Uncontrollable 


MCDONNELL SCALE 


FIGURE 6 , 
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GreeCONTEMPORARY RESEARCH 

In designing the washout Simeuisins for the Ames All-Axis Motion 
Generator, it became necessary and expedient to solicit pilot opinion 
in determining the "best'' set of parameters to use ina given con- 
figuration. To this end, S. F. Schmidt and Bjorn Conrad [Ref. 11] 
used three non-ordinal, relative rating scales in their evaluations 
(ao. 7 -a, b,c). 

The questions related to each scale were particularly tailored to 
the descriptive adjectives shown and they were simple in nature. By 
using pilot comments as an index, the design providing the best over- 
all simulator characteristics was obtained. However, moderate 
changes in the washout circuitry initially selected did not alter pilot 
opinion during subsequent testing. 

It would appear that one or both ot the tollowing tactors were 
responsible for the inability of rating pilots to distinguish minor 
ehanges in Simulator characteristics: 


1. The evaluation task was insensitive to minor changes 
in system response 


Z. the rating scale adjectives were too widely separated 
on psychological continuum. 


During a personal inte ie Conrad discussed the work on which 
he had reported in Ref. 11. In determining the best washout circuitry 


the pilot ratings extracted from his scales were heavily supplemented 


“That servo circuitry of an all-axis motion simulator which 
provides for returning the simulator to its initial position after being 
disturbed. Itis important that this function be executed ata rate 
below a pilot's sensing threshold. 


*Inte rviewed on 10 May 1971 at Analytical Mechanics Associates, 
iiepnoOteralo Alte, California. 
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(1 EXCELLENT 


a Sige e 


Baie Bo 


GOOD 
FAIR 
POOR 
UNACCEPTABLE 


7-a 


ALWAYS 

OFTEN 
OCCASIONALLY 
RARELY 
NEVER 


7~-¢ 


[] MORE DIFFICULT 

(J SLIGHTLY MORE DIFF. 
[J ABOUT THE SAME 

[J LESS DIFFICULT 

[J SUBSTANTIALLY EASIER 


J-b 


[1 NUCH HARDER 
CJ HARDER 
CJ SAME AS 
() EASIER 
[J MUCH EASIER 


7-d 


_ CONRAD SCALES 


FIGURE 7 
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with debriefs. It was primarily through this method of pilot interview 
that the best washout circuitry was obtained. 

He observed that pilots rapidly adapted to minor configuration 
changes without altering their rating, and he described this lack of 


sensitivity as a rating plateau (Fig. 8). 





Performance 


Piceh RESPONSE 


Figure 8 


He additionally noticed that a pilot's impression of his mean 
performance changed from day to day. This, therefore, required 
that at least one test run utilizing the ''standard" washout circuitry 
be conducted to reestablish the pilot's mean performance, a time- 
consuming and costly procedure. 

Conrad's present work, an extension of that above, tasks pilots 
with flying formation on the television display of a six-degree of 
freedom simulated tanker aircraft. It is hishope that this relative 
position task will prove to be sufficiently sensitive and thereby 


provide reliable pilot ratings on the scale depicted in Figure 7-d. 


5 | lO albs sly? 
The rating scales which have been reviewed fall into the two 


categories, as distinguished according to purpose, of aggregaie and 
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relative handling qualities evaluations. The first category consists 
of the Cooper and Cooper-Harper Scales; whereas, the latter consists 
of the Harper, McDonnell and Conrad Scales. 

During a personal eceie Cooper related the circumstances 
stimulating the evolution of his Scale. While evaluating a variable 
stability F6F Wildcat the project engineers had an understandable 
tendency to mathematically manipulate the flight data in the course 
of its reduction; however, the conclusions derived therefrom did not 
necessarily reflect the pilot's interpretation of the actual handling 
qualities encountered. To eliminate this inadvertent misinterpreta- 
tion of flight data, the Cooper Scale was designed. 

When Cooper presented his Scale at the annual meeting of the 
Institute of Aeronautical Sciences it was immediately accepted and 


h 


Pieemsdt1Onally implemented as an aggregate evaluation scale. Though 
the Cooper Scale was not designed for this purpose, international 
usage determined its application. 

In the collaborative effort to develop the Cooper-Harper Scale, 
Harper advocated a relative evaluation scale; however, the various 
implementing institutions preferred a scale applicable to aggregate 
evaluations and the dichotomous scale resulted. 

The Harper and Conrad Scales were obviously designed to 
evaluate relative handling qualities and no further discussion is 
necessary. 


The McDonnell (or Global) Scale was designed as an aggregate 


rating scale; however, because of its syntactical simplicity it could 


“Inheswemer on 10 May 1971 at the Ames Research Laboratory, 
NASA, NAS Moffett Field, California. 
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be applied only to relative evaluations (see Fig. 6). The sixty-three 
psychologically intervaled phrases resulting from McDonnell's re- 
search, however, were applicable to both aggregate and relative 
handling qualities evaluations. 

In evaluations utilizing any of the rating scales except the Cooper- 
Harper Scale subjective pilot comment was required to provide 


meaningful evaluation data. 
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Di aU MAN RESPONSE 


Pee PRODUCTION 

The Cooper-Harper Scale was excellently designed and remains 
the best aggregate rating scale in existence because of its dichotomous 
nature and its acceptance as the international standard. However, it 
was specifically designed so as not to facilitate the averaging of 
ratings [9]. 

With the advent of greater sophistication in aircraft research 
and development, it has become increasingly important to evaluate 
the relative ''goodness" of aircraft components and subsystems. It 
is assumed that a highly desirable aerospace vehicle may be designed 
meee) enOwmen cr. aerating Scale capeble of ssdiably 
acceptance or rejection of one highly desirable system over another 
is yet to be evolved. It is the purpose of this section to investigate 
the possibility of sucha rating scale. 

Pormdauscale to effectively retleéct minor differences in performance, 
extreme sensitivity is desired. The inherent advantages of linearity 
are also desired to facilitate mathematical operations on a limited 
ensemble and thereby suppress research and procurement costs. 

The hypothesis of this investigation is that a linear rating scale 
coincident with the psychological continuum begets sensitivity. The 
psychological continuum was investigated [10] and resulted in the 
McDonnell Scale, but, as may be noted from Figure 6, descriptive 
adjectives and/or phrases did not align cardinally. This, then, 


provided a source of confusion because the numerical value associated 
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with the adjective might not coincide with the rater's psychological 
continuum. Were this source of syntactic confusion eliminated, the 
rater could transpose his impression of performance directly toa 
rating scale and thereby relate his psychological continuum toa 
linear numerical index. Additionally, if allowed to fractionalize his 
rating, sensitivity would be limited only by the rater's discriminate 
di epawetionm and frustrations. 

To investigate this hypothesis, a simple puzzle was selected and 
submitted to the analytically inclined students in the Department of 
Aeronautics of the Naval Postgraduate School. Upon completion of 
the test, or at the expiration of an allotted time, the subjects were 
asked to rate ne impression of the difticulty they encountered in 


working the puzzle on three numerical scales. 


fee LEST EQUIPMENT 

The plastic Kohner EVEN-STEVEN solitaire puzzle (Fig. 9) was 
used as the testing device. It consisted of a base with eight equal 
depth holes, eight equal length sleeves with variable interior depths, 
and eight variable length pegs. The puzzle had 40, 320 (eight factorial) 
different solutions, one of which resulted in all pegs being even. 

A standard stop-watch was used for timing, and the scales 


depicted in TABLE 1 were used for rating purpose. 


G LES G PROCEDURE 
Before starting the exercise, the subjects were briefed in detail 


regarding the physical characteristics of the puzzle. Prior to each 


>The deviation of the resolving power of raters to distinguish 
minor differences in performance. 
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TABLE I 
RATER QUESTTONAIPRE 


NAME DATE 
AGE 








You are requested to solve the LVEN-STEVEN puzzle as a Humar Re- 
sponse Section of a Thesis. You will have 60 seconds in which to com- 
plete the exercise. After completing, please indicate the degree of 


Difficulty you encountered while performing the exercise. 


DIRECTIONS : 


1. Set the purple sleeves in the hase. 


2. Insert the orange pegs in the sleeves so that the tops are 
even (make STEVEN-EVEN). 
3. Do not look into the sleeves for distance estimations. 


TEST: 60 seconds 


Mame Beloweare threcutatine ccales with the direction of Increas- 
ing Difficulty as indicated. lse all three scales. 


1. Indicate your impression of Difficulty in the box provided. 
Z Indicate the scale you prefer; A, B or C. 


SCALE A poy pp ty py fy ff 4+ f+ f+ +} + 


O | 2 3 4 5 6 v4 8 9 ID 
SCALE B Pg 

O | 2 3 4 
SCALE C pop tt ttt 
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test the pegs and sleeves were removed from the base and mixed ran- 
domly within a box before the subject. The exercise was started on 
the proctor's "mark" with the subject's hands poised over the box. 

At test completion the time was recorded or, if the subject did not 
complete the test in 60 seconds, the number of even pegs, regardless 
of height, was recorded. The elapsed time or number of even pegs 
was the basis for determining performance. 

The subject was then asked to rate his impression of the difficulty 
he encountered in working the puzzle with respect to all three scales 
Smee RATIR QUESTIONNAIRE (TABLE 1), and to indicate his rating 
in the box provided. This procedure was repeated twice to provide 
Hemetenmree tests. When subjects inquired as to the degree of difficulty 
associated with scale end points, they were briefed that this deter- 
mination was the rater's responsibility. By so doing, tne rater's 


personal psychological continuum was enjoined. 


Peek oUltS AND DISCUSSION 

The exercise was administered to thirty-one subjects as outlined 
above, and the raw data were recorded in Appendix A. Of the subjects 
tested, 25 or 80.8% understood the rating procedure. The remaining 
six failed to rate their impression of the difficulty they encountered 
as evidenced by their constant ratings on each scale, regardless of 
their performance, throughout the testing sequence. Consequently 
these data were discarded because it was impossible to determine 
the linear correlation of a point. 

1. Question Formulation and Interpretation 


The failure of 19.2% of the subjects to comprehend the rating 


procedure may be the result of incorrectly written rating statements 
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WW 


meee ee please indicate the degree of Difficulty you encountered 


while performing the exercise." 


and ''Indicate your impression of 
Difficulty in the box provided.'') However, these statements were 
combined during the pretesting brief (i.e., "Indicate your impression 
of the Difficulty you encountered in working the puzzle."'). 

When this 19.2% was queried regarding their constant rating, all 
replied that the difficulty of the test was a constant regardless of 
eaeir performance. 

Subjects 7, 12 and 14 (TABLETII) all had inappropriately low 
correlation factors because their ratings indicated increased difficulty 
for increased performance. When each was queried, he related that 
more incorrect puzzle combinations were discovered in subsequent 
testing; consequently, his impression of puzzle difficulty increased. 
Although these ratings did not properly refiect the rating statements, 
they were used in the Linearity section because such deviations may 
be expected in any testing procedure. 

eo mie a GILy 

Linear correlation assumes a linear relationship between 
variables. Ifa series of variables are linearly related, the cor- 
relation factor will be 1.00. Deviations from linearity will yield 
factors less than 1. 00. 

To facilitate detailed analysis and to justify raw data averaging, 
anmeimenviciial correlation factor {r) was calculated for each exercise 
subject listed in TABLE II. In correlation factor calculations the 
time to exercise completion or the number of even pegs was used as 
the independent variable, and the subject's rating was used as the 


dependent variable. 


Sy | 





SUBJECT 


~  & FF 


Ve) oO ~ ON OS 


10 
elt 
2 
2 


soe 
709) 
SSS)/2 
205 
els 
Scie) e, 
161 
3936 
v0 
. 866 
~S45 
5 a2, 
Soy 


TABLE II 


RATER PERFORMANCE - RATING 


CORRELATION FACTOR 


CORRELATION FACTORS (r) 


SCALE 
B 
993 


ooo) 


506 
566 
.600 
EES: 
vod 


SUBJECT 


a2 


14 
Le 
16 
Ts 
18 
Us 
20 
yas 
ez. 
ZS 
24 
ZS 


oy 
good 
Sk, 
Behe 





Scales A and B yielded correlation factors of which 90. 9% 
were greater than 0.8 and 81. 8% were greater than 0.9. Scale C 
yielded 77% and 72% respectively. The overall correlation factors 
for Scales A, Band C were 0. 928, 0.905 and 0. 927 respectively. 
This high degree of performance-rating correlation confirmed 
linearity and sensitivity, and was an extremely strong indication 
that raters were able to relate their personal psychological continuum 
to a linear, non-adjectival, non-ordinal rating scale. It additionally 
provided justification for the averaging of ratings. 

Another feature of high correlation is that relatively few 
trials may be conducted witha high degree of confidence in the 
resulting data. This thereby reduces the time and cost expenditures 
associated with testing. 

3. Rating Anaiysis 

The test subjects’ ratings fell into two groups as characteri- 
zed by those who completed all tests during the allotted time (Group 
X) and those who completed two or less tests (Group Y). As indicated 
in Figure 10, Group X experienced less difficulty than Y throughout 
the testing sequence; however, the rating curves of Group X reflected 
decreased learning in contrast to the curves of Group Y. 

It should be noted that the rating curves of Group Y did not 
remain parallel as did those of Group X. This was, perhaps, an 
indication of the frustration experienced in not being able to complete 
Catea test ioUen qd iactOnr Would intluence rating accuracy and, con- 
sequently, rating sensitivity. 

By averaging the unweighted corresponding test ratings of 


both Groups (there were more subjects in Group Y), Figure 11 was 
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constructed. As may be observed, the average rating curves ranged 
about the numerical mean of each scale, and, in fact, the average 
Patines, Of ocales| A, Band C were 5.00, 2.02 and 5. 02 respectively. 

Considering these two facts, it must be assumed that the 
test subjects discarded any ''degree of difficulty'' associated with 
the scale end points and related all of their ratings to the scale 
numerical mean. Consequently, all rating was a matter of judgment; 
a matter of relating their psychological continuum to the scales 
presented in TABLE I. Whether test subjects consciously or sub- 
consciously related to the scales' numerical means was beyond the 
scope of this study. 

4. Scale Preference 

Of the 3! test subjects, 28 preferred Scale A, two preferred 
peale B, and one preferred Scale C. It was interestiny to note that 
Scale A construction paralleled that of the Cooper-Harper Scale 
(i.e., increasing numerical index with increasing degree of ''badness"'); 
however, only 35% of the test subjects had ever been exposed to the 
Cooper-Harper Scale. Because the subjects were enrolled ina 
mathematically oriented curriculum, the preference for a decimal 
Becta based on ten seemed appropriate. As evidenced from the 
overall correlation factors and the Scale average ratings, the pre- 
ference for Scale A appeared valid. 

itemmustea preterence for Scale B was believed to reflect 
exposure to the 4.0 Navy system. The preference for Scale C was 
believed to have been made in the interests of inconsistency and 
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5. Recommendations 

Because of the high correlation experienced during this 
investigation and the preference of raters for a decimal scale based 
on ten, it is recommended that such a scale be used inall relative 
rating evaluations. 

Although the Cooper-Harper Scale is unequivocally accepted 
for its designed purpose, it could be improved if the scale advanced 
herein were used to evaluate the relative ''goodness'' within a Cooper- 
Harper ordinal category. For example, once an ordinal category 
were determined via the dichotomus procedure, the category could 
be further defined by Scale A utilization. To designate sucha refine- 
ment, the first number ofa series could reflect the non-averagable 
Cooper-Harper rating and subsequent numbers reflect Scale A 


me ie. 25). 


E. CONCLUSIONS 

The purpose of this study was to examine and compare the rating 
scales presently in use and to investigate the possibilities of a linear 
rating scale. 

A review of rating scale development and a study of the current 
rating scales were presented in Section II. Section II also provides 
an organized source of information for the rating-scale novice that 
may be used to develop specialized rating scales. 

Section III] advanced with some substantiation the hypothesis that 
a rater may transpose his impression of performance directly toa 
non-adjectival, non-ordinal rating scale and thereby relate his 


psychological continuum to a linear numerical index. Twenty-five 
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test subjects utilized sucha scale and 81. 8% had correlation factors 
in excess of 0.953 during three tests. 

The use of a non-adjectival, non-ordinal scale could provide 
simplicity, linearity, averaging, high correlation and a high confidence 
for minimum testing. Sucha scale, if used in contemporary testing, 


might greatly reduce evaluation and procurement costs. 
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APPENDIX A 
RATER DATA 
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APPENDIX A 
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APPENDIX B 


(Continued) 
PHRASE PSYCHOLOGICAL 
MEAN 
Response Characteristics 
Excellent, pure (i.e., no accidental 
excitation) primary and secondary 
response characteristics 0.99 
Good, relatively pure, primary and secondary 
response characteristics 2.47 
Fair, somewhat impure primary or secondary 
response characteristics 4.62 
Quite sensitive, sluggish or uncontrollable 
in primary or secondary responses 6.00 
Extrenely sensitive, sluggish or uncontrollable 
in primary or secondary responses 7e10 
Ei toctoeet Deliciencies 
Effects of deficiencies on performance is easily 
compensated for by pilot 4.04 
Some minor but annoying deficiencies 4.50 
Moderately objectionable deficiencies Dod 
Major, very objectionable deficiencies 1-05 
Demands on Pilot 
Completely umdemanding of pilots, very relaxed 
and comfortable i.63 
Largely undemanding of pilots, relaxed 2356 
Mildly demanding of pilot attention, skill 
or effort 4.22 
Demanding of pilot attention, skill or 
effort 5.88 
Very demanding of pilot attention, skill or 
effort ~~ Tale 
Completely demanding of pilot attention, 
skill or effort $36 
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APPENDIX B 


LIST OF EVALUATION PHRASES 


PHRASE PSYCHOLOGICAL 
MEAN 
Handling Qualities 
Excellent handling qualities 1.00 
Highly desirable handling qualities 147 
Good handling qualities Za8 
Pleasant handling qualities 2.65 
Fair handling qualities 4.13 
Bad handling qualities 7.74 
Very bad handling qualities Bice 
Control 
Extremely easy to control with excellent 
precision 0.97 
Verymeasee to contra? with good pracision 1.76 
Easy to contrul with fair precision ed 
Controllable with somewhat idadequate presision 22435 
Controllable, but only very imprecisely G,05 
Difficult to control rele: 
Very difficult to control Sao 
Nearly uncoltrollatle S201 
Precision 
Extremely easy to control with excellent 
precision Oso 
Very easy to control with good precision ao 
Easy to control with fair precision eae 
Controllatle with somewhat idadequate 
precision S45 
Controllable, but only very imprecisely 6.65 
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